Creating px-files in R, using PXJob

PXEdit/PXJob are fantastic tools to create px-files, manually or automated, as it brings data and metadata together for dissemination with PXWeb/PXWin/PCAxis.

This document presents an example on how to use PXJob to transform a structural file into px-files. Any structural dataset would do, but as example log-files from PXWeb on the statbank server at Statistics Greenland is used. Some logfiles have been included in the github, R-package: Github.com/StatisticsGreenland/statbanklog.

Example: PXWeb-logfiles

‘PXWeb-out-of-the-box’ does not offer usage statistics of any kind. Most likely all PXWeb-based services regularly need to deliver management information as PXWeb is vital to many statistical offices dissemination policy. But PXWeb does store the information needed in logfiles. And the easiest methods to disseminate information is naturally with PXWeb.

The tricky part is transfering the information to px-files. But as this example will demonstrate, it is not all that difficult, when using PXJob. To serve as learning material this example use R, free to all, to demonstrate how PXJob runs in a Windows environment, but the methods are transferable to other programming languages.

Read all log-files from a folder/directory

PXWeb has 3 different log-files apart from the normal web-server logs. They are stored on the server in:

~/PxWeb/logs

api.log* pxweb.log* stats.log*

In this example we take a look at stats.log, saved as one file pr day, where one line is added to the log-file for each table being extrated or manipulated in the StatBank.

When using read_log from Tidyverse the line is chopped using blanks as separator, naming the columns X1,x2,…,Xn

First all files in the logs folder are read and passed to a file called statlog

# Preface --- optional --------------------------------------------------------
# rm(list = ls()) # Empty environment
# setwd("S:/STATBANK/bank2020/log") # Set your work-directory or use the same as this RMarkdown

# Load libraries --------------------------------------------------------------
# Standard R-packages from CRAN

# A GOOD place to start reading: https://r4ds.had.co.nz/
# if(!require("tidyverse")) {install.packages("tidyverse")}
library(tidyverse)

# lubridate makes dealing with dates a little easier
# if(!require("lubridate")) {install.packages("lubridate")}
library(lubridate)

# the janitor package has functions to clean data
#if(!require("janitor")) {install.packages("janitor")}
library(janitor)

# Import data (stat* files) -----------------------------------------------------------------

# sample from Statistics Greenland (included in pxjob64Win-package)
log_dir <- c(system.file(package = "pxjob64Win","logs/logs/"))

# If you have your own set of log-files, edit path to folder here, and modify code below
# Set your log-directory (stat-input files)
 log_dir <- "S:/STATBANK/bank2020/log/pxwebLog/logs/"

# stats - log ------------------------------------------------------------------------
# Read all log-files to one file with 'bind_rows'
stats_logs_raw <- dir(path = log_dir, pattern = 'stat*', full.names = TRUE) %>% 
  map(read_log) %>% 
  reduce(bind_rows) %>% 
   as_tibble()
  

# Transform ------------------------------------------------------------------------

# Get column names (nms) from logs_raw (use function from janitor)
nms <- stats_logs_raw %>% slice(1) %>% flatten_chr() %>% 
  str_extract(".*(?==)") %>% 
  coalesce(names(stats_logs_raw)) %>% 
  make_clean_names()

statlog <- stats_logs_raw %>% 
  set_names(nms) %>% 
  rename(date = x1, hms = x2) %>% 
  select(
    # Removes columns, where there are 0/1 unique values
    where(~length(unique(.x)) > 1)
  ) %>% 
  mutate(
    # Removes all before \\, all before =, and commas at end
    across(everything(), ~ gsub(".*\\\\|.*=|,$", "", .x)),
    # Removes .px extension
    table_id = str_remove(table_id, ".px$|.PX$"),
    # Convert date to Date type
    date = ymd(date),
    y = year(date),
    month = sprintf("%02d",month(date)),
    day = day(date),
    yymm = paste(y,"M",month,sep = "")
  ) 

# 
# library(sqldf)
# variableContent <- sqldf("select distinct action_name from stats_logs")

print.data.frame(head(statlog,10))

##          date          hms x3   context language  database  action_type
## 1  2021-10-20 03:34:42,273 15 Selection       kl Greenland Presentation
## 2  2021-10-20 04:52:35,315 23 Selection       en Greenland Presentation
## 3  2021-10-20 04:53:01,050 38 Selection       en Greenland Presentation
## 4  2021-10-20 04:53:29,659 39 Selection       en Greenland Presentation
## 5  2021-10-20 05:27:56,363 32 Selection       da Greenland Presentation
## 6  2021-10-20 05:28:44,801 32 Selection       da Greenland Presentation
## 7  2021-10-20 05:29:35,958 17 Selection       da Greenland Presentation
## 8  2021-10-20 05:29:57,427 32 Selection       da Greenland Presentation
## 9  2021-10-20 05:30:03,287 18 Selection       da Greenland Presentation
## 10 2021-10-20 05:31:06,475 36 Selection       da Greenland Presentation
##         action_name table_id number_of_cells number_of_contents    y month day
## 1  tableViewLayout1    NAXT1               1                  1 2021    10  20
## 2  tableViewLayout1 BEXBBLK1               1                  1 2021    10  20
## 3  tableViewLayout1  BEXBBL3               2                  1 2021    10  20
## 4  tableViewLayout1  BEXBBL3              10                  1 2021    10  20
## 5  tableViewLayout1   TUXFLY              12                  1 2021    10  20
## 6   chartViewColumn   TUXFLY             144                  1 2021    10  20
## 7     chartViewLine   TUXFLY             144                  1 2021    10  20
## 8   chartViewColumn   TUXFLY             132                  1 2021    10  20
## 9  tableViewLayout2   TUXFLY             132                  1 2021    10  20
## 10 tableViewLayout2   TUXFLY              16                  1 2021    10  20
##       yymm
## 1  2021M10
## 2  2021M10
## 3  2021M10
## 4  2021M10
## 5  2021M10
## 6  2021M10
## 7  2021M10
## 8  2021M10
## 9  2021M10
## 10 2021M10

Subset from stats-log

The log-files holds a lot of information and one could create one px-file with all information. This will result in a very big px-file, so in this first table, only some few variables will have focus; time, matrix and language

With these filters applied:

database == “Greenland”
action_type == “Presentation”
action_name == (“tableViewLayout1” | actionName == “tableViewLayout2”)

statlog_filtered <- 
  statlog %>% 
  filter(database == "Greenland") %>% 
  filter(action_type == "Presentation") %>% 
  filter(action_name == "tableViewLayout1" | action_name == "tableViewLayout2") %>% 
  filter(context == "Selection") %>% 
  select(yymm,table_id,language) %>% 
  mutate(antal = 1) %>%
  group_by(yymm,table_id,language) %>%
  summarise(value = sum(antal)) %>%
  ungroup() 

print.data.frame(head(statlog_filtered,10))

##       yymm table_id language value
## 1  2020M11   ALXALK       da     8
## 2  2020M11   ALXALK       en     2
## 3  2020M11  ALXALK1       da     4
## 4  2020M11  ALXALK2       da     2
## 5  2020M11  ALXALK3       da     2
## 6  2020M11  ALXALK4       da    12
## 7  2020M11  ALXALK4       en     6
## 8  2020M11  ALXTOB1       da     1
## 9  2020M11  ALXTOB2       en     1
## 10 2020M11  ALXTOB3       da     3

Notice that the name of each dimension in the tibble(dataframe) is unique.

Starting with PXJob 4.0 mulitilingual px-files can be created from a (px)structural table holding data and metadata in chosen languages. Additional metadata can be added/edited in keyword blocks.

Studying and comparing a structural file to a tibble in R/Tidyverse the repetition of column-headers as variable-names has to change. Each column-name needs to be unique in a tibble.

In the PXEdit Install package for windows you can find additional information in ‘Introduction_to_structural_tables.pdf’

To accommodate the need for unique column headers in a tibble, Veli-Matti Jantunen included some few changes to PXJob, allowing extensions in variable-names:

• if the names are equal when the (last) extension is removed, they’ll become the variable names (e.g. population.da population.no)

• language codes or other meaningful texts can be used (the length is not important), but they are NOT used at this point, i.e. otherwise than recognising and removing the extension, the algorithm works as documented

• if variable names are used in keyword blocks, use the bare variable name (population)

Adding multilingual texts to the statlog tibble

In the statlog tibble the ‘language’ variable shall now support English, Danish and Greenlandic texts

In other words, the tibble-columns: time, matrix, language.code, language.en, language.da, language.gl, value

is transferred as: time, matrix, language, language, language, language, value (in the structural file)

by_language.sort <- c("da" = "1",
                      "kl" = "2",
                      "en" = "0")

by_language.en <- c("da" = "Danish",
                "kl" = "Greenlandic",
                "en" = "English")

by_language.da <- c("da" = "Dansk",
               "kl" = "Grønlandsk",
               "en" = "Engelsk")

by_language.kl <- c("da" = "Qallunaatut",
               "kl" = "Kalaallisut",
               "en" = "Tuluttut")

The multilingual tibble

Metadata from formats are now added to the statlog, filtered tibble and is then ready to be converted to a PX file with PXJob

# stats - log
slog <- statlog_filtered %>% 
  mutate(language.code = language) %>%
  mutate(language.sort = by_language.sort[language.code]) %>%
  mutate(language.en = by_language.en[language.code]) %>%
  mutate(language.da = by_language.da[language.code]) %>%
  mutate(language.kl = by_language.kl[language.code]) %>% 
  rename(time = yymm, matrix = table_id) %>% 
  arrange(time,matrix,language.sort) %>%
  select(time,matrix,language.code,language.en,language.da,language.kl,value) 

print.data.frame(head(slog))

##      time  matrix language.code language.en language.da language.kl value
## 1 2020M11  ALXALK            en     English     Engelsk    Tuluttut     2
## 2 2020M11  ALXALK            da      Danish       Dansk Qallunaatut     8
## 3 2020M11 ALXALK1            da      Danish       Dansk Qallunaatut     4
## 4 2020M11 ALXALK2            da      Danish       Dansk Qallunaatut     2
## 5 2020M11 ALXALK3            da      Danish       Dansk Qallunaatut     2
## 6 2020M11 ALXALK4            en     English     Engelsk    Tuluttut     6

PXJob with R

PXJob is a windows program and it is installed as part of the PXEdit program. If the program is installed on your pc/server you might need a newer version of core PXJob executables.

For now (January 2021) a testversion of 3 core files from PXJob has been added to a R-package on www.github.com/LarsLap/pxjob64Win (They will be updated with the final version later in 2021)

After installation in R, R will know where to find the components by this command:

pxjob <- c(system.file(package = “pxjob64Win”,“PxJob64.exe”))

This will also work in a Windows environment, where you do not have PxEdit/PXJob installed, but are allowed to add and run external R-packages

# Commands used to setup Do_Px ------------------------------------------------------------

# install.packages("devtools")
# library(devtools)
# devtools::install_github("LarsLap/pxjob64Win")

library(pxjob64Win)

Do_PX by PXJob

After a file with data has been created it is time to convert it to a PX-file. The PX-file will be a lot richer in metadata, than the values and codes, defined so far. Metadata is added in keyword blocks.

PXJob does allow all metadata added in one block. But it blows my mind.

Therefore the following code does the creation step-by-step

# path to PXJob components
pxjob <- c(system.file(package = "pxjob64Win","PxJob64.exe"))

# temporary files on disc are deleted
tmp_file = "output_tmp.csv"
ctl_file = "control.csv"



# setup Do_Px ------------------------------------------------------------

matrix = "SBXSBSTATS02"
title = c("Statbank usage, Statistics Greenland;en,da,kl")
languagecode = "languagecode;en;da;kl"
variables = "VARIABLES;time,matrix,language;tid,matrix,sprog;piffissaq,matrix,oqaatsit"
contents = "CONTENTS;count;antal;amerlassusaat"

description_en = "Statbank usage by table ID <em>[SBESTATS02]</em>;en"
description_da = "Antal udtræk per tabel <em>[SBDSTATS02]</em>;da"
description_kl = "Tabelimit ataatsimit aaneqarsinnaasut <em>[SBNSTATS02]</em>;kl"

writeLines(paste0("languages in file",";en,da,kl"),tmp_file)
write.table(slog,file=tmp_file,row.names=FALSE,append=TRUE,sep=';', fileEncoding = "utf8")
write_lines("",tmp_file, append = TRUE)
write_lines("",tmp_file, append = TRUE)
write_lines(languagecode,tmp_file, append = TRUE)
write_lines(variables,tmp_file, append = TRUE)
write_lines(contents,tmp_file, append = TRUE)



# convert to px. Read the tutorials :-)
# Saving switch !C combine codes to value texts ( csv and xls)
# switch c?: character conversion, -c1 = Unicode ( UTF-8 )
# switch -f* -f=fill, not summarize: *="..", calculate: *="-"
system(paste(pxjob, " px" , " in=",tmp_file," out=",matrix,".px -c1 -f- !C",sep=""))

## [1] 0

if (file.exists(tmp_file)) {file.remove(tmp_file)}

## [1] TRUE

# Pivot
# Do control file
writeLines("STUB;HEADING",ctl_file)
write_lines("matrix,language;time",ctl_file, append = TRUE)
system(paste0(pxjob, " px" , " in=", matrix, ".px", " meta=", ctl_file))

## [1] 0

#Delete file if it exists
if (file.exists(ctl_file)) {file.remove(ctl_file)}

## [1] TRUE

# END Pivot

# Matrix, Subject-code
# Do control file
writeLines("MATRIX;SUBJECT-CODE",ctl_file)
write_lines(matrix,ctl_file, append = TRUE)
write_lines(";SB",ctl_file, append = TRUE)

system(paste(pxjob, " px" , " in=", matrix, ".px", " meta=", ctl_file, " -a", sep=""))

## [1] 0

#Delete file if it exists
if (file.exists(ctl_file)) {file.remove(ctl_file)}

## [1] TRUE

# END Matrix

# Subject-area, units
# Do control file
writeLines("SUBJECT-AREA;UNITS;languagecode",ctl_file)
write_lines("StatBank;;en",ctl_file, append = TRUE)
write_lines("Statistikbank;;da",ctl_file, append = TRUE)
write_lines("Kisitsisaataasivik;;kl",ctl_file, append = TRUE)
write_lines(";count;en",ctl_file, append = TRUE)
write_lines(";antal;da",ctl_file, append = TRUE)
write_lines(";count;kl",ctl_file, append = TRUE)

system(paste0(pxjob, " px" , " in=", matrix, ".px", " meta=", ctl_file, " -a"))

## [1] 0

#Delete file if it exists
if (file.exists(ctl_file)) {file.remove(ctl_file)}

## [1] TRUE

# END Subject-area, units


# Description
# Do control file
writeLines("DESCRIPTION;languagecode",ctl_file)
write_lines(description_en,ctl_file, append = TRUE)
write_lines(description_da,ctl_file, append = TRUE)
write_lines(description_kl,ctl_file, append = TRUE)

system(paste0(pxjob, " px" , " in=", matrix, ".px", " meta=", ctl_file, " -a"))

## [1] 0

#Delete file if it exists
if (file.exists(ctl_file)) {file.remove(ctl_file)}

## [1] TRUE

# END Description

# Timeval
# Do control file
 write_lines("TIMEVAL;M",ctl_file, append = TRUE)
 write_lines("variablename;time",ctl_file, append = TRUE)
#
 system(paste0(pxjob, " px" , " in=", matrix, ".px", " meta=", ctl_file, " -a"))

## [1] 0

# 
# #Delete file if it exists
 if (file.exists(ctl_file)) {file.remove(ctl_file)}

## [1] TRUE

# # END Description

# Elimination
# Do control file
writeLines("ELIMINATION;languagecode;variablename",ctl_file)
write_lines("YES;en;matrix",ctl_file, append = TRUE)
write_lines("YES;da;matrix",ctl_file, append = TRUE)
write_lines("YES;kl;matrix",ctl_file, append = TRUE)

write_lines("YES;en;language",ctl_file, append = TRUE)
write_lines("YES;da;sprog",ctl_file, append = TRUE)
write_lines("YES;kl;oqaatsit",ctl_file, append = TRUE)
#
system(paste0(pxjob, " px" , " in=", matrix, ".px", " meta=", ctl_file, " -a"))

## [1] 0

#Delete file if it exists
if (file.exists(ctl_file)) {file.remove(ctl_file)}

## [1] TRUE

# END Elimination

How to read api & pxweb log files

# # api log files ------------------------------------------------------------
# 
# read_log2 <- function(filename, pattern, ...) {
#   read_lines(filename) %>% 
#     keep(~str_detect(.x, pattern)) %>% 
#     read_log(...)
# }
# 
# api_logs_raw <- 
#   dir(path = log_dir, pattern = 'api*', full.names = TRUE) %>% 
#   map(read_log2, pattern = "^20") %>% 
#   reduce(bind_rows) 
# 
# 
# # Get column names (nms) from logs_raw (use function from janitor)
# api_nms <- api_logs_raw %>% slice(1) %>% flatten_chr() %>% 
#   str_extract(".*(?==)") %>% 
#   coalesce(names(api_logs_raw)) %>% 
#   make_clean_names()
# 
# api_logs <- api_logs_raw %>% 
#   set_names(api_nms) %>% 
#   rename(date = x1, hms = x2) %>% 
#   select(
#     # Removes columns, where there are 0/1 unique values
#     where(~length(unique(.x)) > 1)
#   ) %>% 
#   mutate(
#     # Removes all before \\, all before =, and commas at end
#     across(everything(), ~ gsub(".*\\\\|.*=|,$", "", .x)),
#     # Removes .px extension
#     table_id = str_remove(url, ".px$|.PX$"),
#     table_id = str_remove(table_id, "/api/v1/"),
#     lang = substring(table_id,1,2),
#     tjek = str_remove(url, "/api/v1/kl/Greenland/|/api/v1/da/Greenland/|/api/v1/en/Greenland/"),
#     subcode = substring(tjek,1,2),
#     # Convert date to Date type
#     date = ymd(date),
#     y = year(date),
#     month = sprintf("%02d",month(date)),
#     day = day(date),
#     yymm = paste(y,"M",month,sep = "")
#   ) %>% 
#   select(y,month,day,yymm,lang,url,subcode,tjek,type,caller,cached) %>% 
#   filter(subcode != '/a', type == 'data', subcode != '') %>% 
#       mutate(tjek2 = str_remove(tjek,".*/"),
#              matrix = str_remove(tjek2, ".px$|.PX$") ) %>% 
#   select(-tjek,-tjek2,-url,-type,-cached,-lang,-y,-month,-day)
# 
# 
# # pxweb log files ------------------------------------
# 
# pxweb_logs_raw <- 
#   dir(path = log_dir, pattern = 'pxweb*', full.names = TRUE) %>% 
#   map(read_log2, pattern = "^20") %>% 
#   reduce(bind_rows) 
# 
# 
# pxweb_log_type <- pxweb_logs_raw %>% 
#   select(X1,X4) %>% 
#   mutate(
#     # Convert date to Date type
#     date = ymd(X1),
#     y = year(date),
#     month = sprintf("%02d",month(date)),
#     day = day(date),
#     yymm = paste(y,"M",month,sep = "")
#   ) %>% 
#   select(yymm,X4) %>% 
#   rename(type = X4)
# 
#

Pxweb stat-log to px, with R

LarsP

Last run 20-10-2021